{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "202c45d6-38c4-486e-8d88-9838d3da58d2",
   "metadata": {
    "tags": []
   },
   "source": [
    "# VADER Rule-based Classifier Baseline for IMDB"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3185bab5-abee-4e98-af26-5864c94cb495",
   "metadata": {},
   "source": [
    "- The source code can be found here: https://www.nltk.org/_modules/nltk/sentiment/vader.html\n",
    "- The corresponding paper is\n",
    "\n",
    "> Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for\n",
    "Sentiment Analysis of Social Media Text. Eighth International Conference on\n",
    "Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62bb6402-160d-4fb8-bddc-8bfc58b1e40b",
   "metadata": {
    "colab_type": "text",
    "id": "mQMmKUEisW4W"
   },
   "source": [
    "## Download Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "59003fc4-84ef-41e1-8f35-8701198d8ce8",
   "metadata": {},
   "source": [
    "The following cells will download the IMDB movie review dataset (http://ai.stanford.edu/~amaas/data/sentiment/) for positive-negative sentiment classification in as CSV-formatted file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "8142921f-6931-41b4-b6f7-1e4f830577cb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2021-11-30 18:07:24--  https://github.com/rasbt/python-machine-learning-book-3rd-edition/raw/master/ch08/movie_data.csv.gz\n",
      "Resolving github.com (github.com)... 140.82.114.3\n",
      "Connecting to github.com (github.com)|140.82.114.3|:443... connected.\n",
      "HTTP request sent, awaiting response... 302 Found\n",
      "Location: https://raw.githubusercontent.com/rasbt/python-machine-learning-book-3rd-edition/master/ch08/movie_data.csv.gz [following]\n",
      "--2021-11-30 18:07:24--  https://raw.githubusercontent.com/rasbt/python-machine-learning-book-3rd-edition/master/ch08/movie_data.csv.gz\n",
      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...\n",
      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 26521894 (25M) [application/octet-stream]\n",
      "Saving to: ‘movie_data.csv.gz’\n",
      "\n",
      "movie_data.csv.gz   100%[===================>]  25.29M  18.1MB/s    in 1.4s    \n",
      "\n",
      "2021-11-30 18:07:26 (18.1 MB/s) - ‘movie_data.csv.gz’ saved [26521894/26521894]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget https://github.com/rasbt/python-machine-learning-book-3rd-edition/raw/master/ch08/movie_data.csv.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "96583f4a-ce38-445b-b77d-ea7a4df72f8c",
   "metadata": {},
   "outputs": [],
   "source": [
    "!gunzip -f movie_data.csv.gz "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f3e70f3-d9b8-4cd0-841c-4d579d67ad2e",
   "metadata": {},
   "source": [
    "Check that the dataset looks okay:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "2f5700dc-1d5c-4041-a229-6c4b68e64c8e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>review</th>\n",
       "      <th>sentiment</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>In 1974, the teenager Martha Moxley (Maggie Gr...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>OK... so... I really like Kris Kristofferson a...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>***SPOILER*** Do not read this, if you think a...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>hi for all the people who have seen this wonde...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>I recently bought the DVD, forgetting just how...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                              review  sentiment\n",
       "0  In 1974, the teenager Martha Moxley (Maggie Gr...          1\n",
       "1  OK... so... I really like Kris Kristofferson a...          0\n",
       "2  ***SPOILER*** Do not read this, if you think a...          0\n",
       "3  hi for all the people who have seen this wonde...          1\n",
       "4  I recently bought the DVD, forgetting just how...          0"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "\n",
    "df = pd.read_csv('movie_data.csv')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "ac8a63a0-1093-4328-a454-f01021aca3cc",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "\n",
    "np.random.seed(123)\n",
    "msk = np.random.rand(len(df)) < 0.85\n",
    "df_train = df[msk]\n",
    "df_test = df[~msk]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dcd84812-738e-4b44-9187-2accab43c622",
   "metadata": {},
   "source": [
    "Baseline always predicting the majority class:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "7334e264-c229-43ca-9747-f3927aafd079",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Test accuracy: 50.21%\n"
     ]
    }
   ],
   "source": [
    "acc = df_train['sentiment'].mean()\n",
    "print(f\"Test accuracy: {acc*100:.2f}%\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8ac4c7aa-d1af-4922-84b1-4a79a14f095f",
   "metadata": {},
   "source": [
    "## Using Vader"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e3a86ea-62eb-4547-8330-f0e8c5a142d2",
   "metadata": {},
   "source": [
    "- Note that Vader is rule-based and doesn't require a training set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "4dd9aa7e-000d-45e0-8aa7-dc84f6286d81",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package vader_lexicon to\n",
      "[nltk_data]     /Users/sebastian/nltk_data...\n",
      "[nltk_data]   Package vader_lexicon is already up-to-date!\n",
      "[nltk_data] Downloading package punkt to /Users/sebastian/nltk_data...\n",
      "[nltk_data]   Package punkt is already up-to-date!\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import nltk\n",
    "\n",
    "nltk.download('vader_lexicon')\n",
    "nltk.download('punkt')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "38648713-c97d-4591-91b6-1da675881cd4",
   "metadata": {},
   "source": [
    "### Based on paragraphs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "c259d0e9-7bcb-431c-93cc-d7bc6da17de2",
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk.sentiment.vader import SentimentIntensityAnalyzer\n",
    "\n",
    "\n",
    "y_pred = []\n",
    "sid = SentimentIntensityAnalyzer()\n",
    "for row in df_test.iterrows():\n",
    "    \n",
    "    sscore = sid.polarity_scores(row[1]['review'])\n",
    "    if sscore['neg'] >= sscore['pos']:\n",
    "        y_pred.append(0)\n",
    "    else:\n",
    "        y_pred.append(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "fb1fc954-9098-47e0-9ebb-968273c6315b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Test accuracy: 69.07%\n"
     ]
    }
   ],
   "source": [
    "acc = (df_test['sentiment'] == y_pred).mean()\n",
    "print(f\"Test accuracy: {acc*100:.2f}%\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eb50d78f-15a2-4657-aaa6-c1aa4904bc05",
   "metadata": {},
   "source": [
    "### Based on majority label among individual sentences in each paragraph"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "9190fbe2-dec3-45f7-a8cf-a885967558dc",
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk import tokenize\n",
    "\n",
    "\n",
    "y_pred = []\n",
    "sid = SentimentIntensityAnalyzer()\n",
    "\n",
    "for row in df_test.iterrows():\n",
    "    \n",
    "    sentences = tokenize.sent_tokenize(row[1]['review'])    \n",
    "    sentence_scores = []\n",
    "    \n",
    "    for sentence in sentences:\n",
    "        sscore = sid.polarity_scores(sentence)\n",
    "        if sscore['neg'] >= sscore['pos']:\n",
    "            sentence_scores.append(0)\n",
    "        else:\n",
    "            sentence_scores.append(1)        \n",
    "    mode = np.argmax(np.bincount(sentence_scores))\n",
    "    y_pred.append(mode)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "b460dbfb-2dc8-447f-8bdc-d4d049ab35cb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Test accuracy: 70.49%\n"
     ]
    }
   ],
   "source": [
    "acc = (df_test['sentiment'] == y_pred).mean()\n",
    "print(f\"Test accuracy: {acc*100:.2f}%\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "1ec921a1-df00-4fe0-83a0-d2ffa25cc270",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "numpy : 1.21.2\n",
      "nltk  : 3.6.3\n",
      "pandas: 1.3.2\n",
      "\n"
     ]
    }
   ],
   "source": [
    "%load_ext watermark\n",
    "%watermark --iversions"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
